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Abstract 

We propose a new algorithmic framework for sequential hypothesis testing with i.i.d. data, 
which includes A/B testing, nonparametric two-sample testing, and independence testing as 
special cases. It is novel in several ways: (a) it takes linear time and constant space to compute 
on the fly, (b) it has the same power guarantee as a non-sequential version of the test with the 
same computational constraints up to a small factor, and (c) it accesses only as many samples 
as are required - its stopping time adapts to the unknown difficulty of the problem. All our 
test statistics are constructed to be zero-mean martingales under the null hypothesis, and the 
rejection threshold is governed by a uniform non-asymptotic law of the iterated logarithm (LIL). 
For the case of nonparametric two-sample mean testing, we also provide a finite sample power 
analysis, and the first non-asymptotic stopping time calculations for this class of problems. We 
verify our predictions for type I and II errors and stopping times using simulations. 


1 Introduction 

Nonparametric statistical decision theory poses the problem of making a decision between a null 
(Ho) and alternate (Hi) hypothesis over a dataset with the aim of controlling both false positives 
and false negatives (in statistics terms, maximizing power while controlling type-1 error), all without 
making assumptions about the distribution of the data being analyzed. Hypothesis testing is based 
on a “stochastic proof by contradiction” - the null hypothesis is thought of by default to be true, 
and is rejected only if the observed data are statistically very unlikely under the null. 

There is increasing interest in solving such problems in a “big data” regime, in which the sample 
size N can be huge. We present a sequential testing framework for this problem that is particularly 
suitable for two related scenarios prevalent in many applications: 

1) The dataset is extremely large and high-dimensional, so even a single pass through it is 
prohibitive. 

2) The data is arriving as a stream, and decisions must be made with minimal storage. 
Sequential tests have long been considered strong in such settings. They access the data in an 

online/streaming fashion, assessing after every new datapoint whether it then has enough evidence 
to reject the null hypothesis. However, most prior work is either univariate or parametric or asymp¬ 
totic, while we are the first to provide non-asymptotic guarantees on multivariate nonparametric 
problems. 
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To elaborate on our motivations, suppose we have a gigantic amount of data from each of 
two unknown distributions, enough to detect even a minute difference in their means n i — /j ,2 if 
it exists. Further suppose that, unknown to us, deciding whether the means are equal is actually 
statistically easy (|/ii — is large), meaning that one can conclude 7 ^ ^2 with high confidence 
by just looking at a tiny fraction of the dataset. Can we take advantage of this easiness, despite 
our ignorance of it? 

A naive solution would be to discard most of the data and run a batch (offline) test on a small 
subset. However, we do not know how hard the problem is, and hence do not know how large 
a subset will suffice — sampling too little data might lead to incorrectly not rejecting the null, 
and sampling too much would unnecessarily waste computational resources. If we somehow knew 
/ii — (j, 2 , we would want to choose the fewest number of samples (say n*) to reject the null while 
controlling type I error at some target level. 

Our sequential test solves the problem by automatically stopping after seeing about n* samples, 
while still controlling type I and II errors almost as well as the equivalent linear-time batch test. 
Without knowing the true problem difficulty, we are able to detect it with virtually no computational 
or statistical penalty. We devise and formally analyze a sequential algorithm for a variety of 
problems, starting with a basic test of the bias of a coin, then nonparametric two-sample mean 
testing, and finally general nonparametric two-sample and independence testing. 

Our proposed procedure only keeps track of a single scalar test statistic, which we construct 
to be a zero-mean random walk under the null hypothesis. It is used to test the null hypothesis 
each time a new data point is processed. A major statistical issue is dealing with the apparent 
multiple hypothesis testing problem - if our algorithm observes its first rejection of the null at 
time t, it might raise suspicions of being a false rejection, because t — 1 hypothesis tests were 
already conducted and the t -th may have been rejected purely by chance. Applying some kind of 
multiple testing correction, like the Bonferroni or Benjamini-Hochberg procedure, is exceedingly 
conservative and produces very suboptimal results over a large number of tests. However, since 
the random walk moves only a relatively small amount every iteration, the tests are far from 
independent. Formalizing this intuition requires adapting a classical probability result, the law of 
the iterated logarithm (LIL), with which we control for type I error (when Hq is true). 

The LIL can be described as follows: imagine tossing a fair coin, assigning +1 to heads and — 1 
to tails, and keeping track of the sum St of t coin flips. The LIL asserts that asymptotically, St 
always remains bounded between ±y/2t In In t (and this “envelope” is tight). 

When Hi is true, we prove that the sequential algorithm does not need the whole dataset as 
a batch algorithm would, but automatically stops after processing just “enough” data points to 
detect Hi, depending on the unknown difficulty of the problem being solved. The near-optimal 
nature of this adaptive type II error control (when Hi is true) is again due to the remarkable LIL. 

As mentioned earlier, all of our test statistics can be thought of as random walks, which behave 
like St under Hq. The LIL then characterizes how these random walks behave under Hq - our 
algorithm will keep observing new data since the random walk values will simply bounce around 
within the LIL envelope. Under Hi, this random walk is designed to have nonzero mean, and hence 
will eventually stray outside the LIL envelope, at which point the process stops and rejects the null 
hypothesis. 

For practically applying this argument to finite samples and reasoning about type II error 
and stopping times, we cannot use the classical asymptotic form of the LIL typically stated in 
textbooks like by [7], instead adapting a finite-time extension of the LIL by [2]. As we will see, 
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the technical contribution is necessary to investigate the stopping time, and control type I and II 
errors non-asymptotically and uniformly over all t. 

In summary, our sequential testing framework has the following properties: 

(A) Under Ho, it controls type I error, using a finite-time LIL computable in terms of empirical 
variance. 

(B) Under Hi, and with type II error controlled at a target level, it automatically stops after 
seeing the same number of points as the corresponding computationally-constrained oracle 
batch algorithm. 

(C) Each update takes 0(d ) time and constant memory. 

In later sections, we develop formal versions of these statements. The statistical observations, 
particularly the stopping time, follow from the finite-time LIL through simple concentration of 
measure arguments that extend to very general sequential testing settings, but have seemingly 
remained unobserved in the literature for decades because of the finite-time LIL necessary to make 
them. 

We begin by describing a sequential test for the bias of a coin in Section [2} We then provide 
a sequential test for nonparametric two-sanrple mean testing in Section [3j We run extensive sim¬ 
ulations in Section [4] to bear out our theory about its properties. We end with extensions to the 
general nonparametric two-sample and independence testing problems, in Section [5] Proofs are 
deferred to the appendices. 


1: Fix N and compute pjy 
2: if Sn > Pn then 
3: Reject Ho 

4: else 

5: Fail to reject Hq 


1: Fix N 

2: for n = 1 to N do 

3: Compute q n 

4 : if S n > q n then 

5: Reject Ho and return 

6: Fail to reject Ho 


Figure 1: Batch (left) and sequential (right) tests. 


2 Detecting the Bias of a Coin 

This section will illustrate how a simple sequential test can perform statistically as well as the 
best batch test in hindsight, while automatically stopping essentially as soon as possible. We will 
show that such early stopping can be viewed as quite a general consequence of concentration of 
measure. Just for this section, let K,K\,I \2 represent constants that may take different values on 
each appearance, but are always absolute. 

Consider observing i.i.d. binary flips G {—1,+1} of a coin, which may be fair or 

biased towards +1, with P(Ai = +1) = p. We want to test for fairness, detecting unfairness as 
soon as possible. Concretely, we therefore wish to test, for 5 G (0, |]: 

H 0 -. p=\ vs. Hi(5):p=^ + 6 

For any sample size n, the natural test statistic for this problem is S n = Y17= l * s a 

(scaled) simple mean-zero random walk under Ho- A standard hypothesis testing approach to our 
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problem is a basic batch test involving Sn, which tests for deviations from the null for a fixed 
sample size N (Fig. [IJ left). A basic Hoeffding bound shows that 


Sn<\H r In - =: Pn 
V 2 a 

with probability > 1 — a under the null, so type I error is controlled at level a : 

P Ho ( reject H 0 ) = Ph 0 (S n > pn ) < e~ 2 H/ N = a . 

2.1 A Sequential Test 

The main test we propose will be a sequential test as in Fig. [lj It sees examples as they arrive 
one at a time, up to a large time N, the maximum sample size we can afford. The sequential 
test is defined with a sequence of positive thresholds {q n } n e[N]- We show how to set q n to justify 
statements (A) and (B) in the introduction. 

Type I Error. Just as the batch threshold p n is determined by controlling the type I error with 
a concentration inequality, the sequential test also chooses qi,... ,qN to control the type I error at 
a: 


P H o (reject H 0 ) = P Hq (3n < N : S n > q n ) < a 


(1) 


This inequality concerns the uniform concentration over infinite tails of S n , but what {q n }ne[N] 
satisfies it? Asymptotically, the answer is governed by a foundational result, the LIL: 

Theorem 1 (Law of the iterated logarithm ((13])). With probability 1, limsup = = = \/2. 

n—>oo v n In In n 

The LIL says that q n should have a V n In In n asymptotic dependence on n, but does not specify 
its a dependence. 

Our sequential testing insights rely on a stronger non-asymptotic LIL proved in (|2j, Theorem 2): 
w.p. at least 1 — a, we have |5 n | < \JKn In (^p) =: q n simultaneously for all n > K ln(^) := no- 
This choice of q n satisfies Q for n 0 < n < N, and specifies the sequential test as in Fig. [1} 
(Choosing q n this way is unimprovable in all parameters up to absolute constants ([2])). 


Type II Error. 

(even when N = 
power is: 


For practical purposes, Vln In n < \/lnln N can be treated as a s mall constant 
10 20 , \/lnIn A < 2). Hence, q n ~ pn (more discussion in Appendix D.l), and the 


-Pffi(< 5 ) (3n < N : S n > q n ) > Ph^S) Hn > Qn) (2) 

~ p h 1 (8)(Sn > Pn) (3) 

So the sequential test is essentially as powerful as a batch test with N samples (and similarly the 
n th round of the sequential test is like an n-sample batch test). 

Early Stopping. The standard motivation for using sequential tests is that they often require few 
samples to reject statistically distant alternatives. To investigate this with our working example, 
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suppose N is large and the coin is actually biased, with a fixed unknown 8 > 0. Then, if we 
somehow had full knowledge of 5 when using the batch test and wanted to ensure a desired type II 
error (3 < 1, we would use just enough samples rip{5) (written as n* in context): 

rip(5) = min {n : P Hl (8) ( S n < Pn ) < P) (4) 

so that for all n > rip(8), since p n = o(n), 


P > P Hl (S) ( S n < Pn) = P Hl (S) (Sn ~ n8 < p n - n6) 

> P Hl (S) (Sn ~nS < -KnS) (5) 

Examining © , note that S n — n8 is a mean-zero random walk. Therefore, standard lower bounds 
for the binomial tail tell us that n*p(8) > A suffices, and no test can statistically use much 

less than rip(5) samples under H\(8) to control type II error at p. 

How many samples does the sequential test use? The quantity of interest is the test’s stopping 
time r, which is < N when it rejects Hq and N otherwise. In fact, the expected stopping time is 
close to n* under any alternate hypothesis: 

Theorem 2. For any 5 and any P > 0, there exist absolute constants K\,K -2 such that 


IE Hi [ t ] < 


■*w) 


n 


m 


Theorem [2] shows that the sequential test stops roughly as soon as we could hope for, under any 
alternative 8, despite our ignorance of <5! We will revisit these ideas when presenting our two-sample 


sequential test later in Section 3.1 


2.2 Discussion 

Before moving to the two-sample testing setting, we note the generality of these ideas. Theorem 
[2] is proved for biased coin flips, but it uses only basic concentration of measure ideas: upper and 
lower bounds on the tails of a statistic that is a cumulative sum incremented each timestep. Many 
natural test statistics follow this scheme, particularly those that can be efficiently updated on the 
fly. Our main sequential two-sample test in the next section does also. 

Theorem [ 2 ] is notable for its uniformity over 8 and p. Note that q n (and therefore the sequential 
test) are independent of both of these - we need only to set a target type I error bound a. Under 
any alternative 8 > 0, the theorem holds for all P simultaneously. As p decreases, n*p(8) of course 

increases, but the leading multiplicative factor ^1 + 1 ~ ^ decreases. In fact, with an increasingly 

IE Hi [t] 

stringent P —> 0, we see that - —> 1; so the sequential test in fact stops closer to n*, and 

n * ■ 1—1 

hence r is almost deterministically best possible. Indeed, the proof of Theorem [2] also shows that 

Ph 1 ( t > n) < e~ KnS2 , so the probability of lasting n steps falls off exponentially in to, and is 
therefore quite sharply concentrated near the optimum n*p(8). 

This precise line of reasoning is formalized completely non-asymptotically in the analysis of 
our main two-sample test for the problem ([6]), though that result is in a stronger high-dimensional 
setting. 


5 







3 Two-Sample Mean Testing 


Assume that we have samples X\,..., X n , ■ ■ ■ ~ P and Y±,. . . , Y n , ■ ■ ■ ~ Q, with P, Q being unknown 
arbitrary continuous distributions on with means fi\ = E_\'~p[ X],fi 2 = E y~q[Y], and we need 
to test 

H 0 : iii = 112 vs. Hi : m ^ p 2 (6) 

Denote covariances of P, Q by Ei, E 2 and E := l(Ei + E 2 ). Define 5 := fii — p 2 so that 5 = 0 
under Hq. Let $(•) denote the standard Gaussian CDF, [Inln] + (x) := lnln[max(x,e 6 )]. 

3.1 A Linear-Time Sequential Test 

In this section, we present our main sequential two-sample test using the scheme in Fig. [TJ so we 
only need to specify a sequence of rejection thresholds q n . To do this, we denote 

hi = (X 2i -i - Y2i-i) T (X 2i - Y 2i ). 

and define our sequential test statistic as the following stochastic process evolving with n: 

n 

Tn = y>. 
i=l 

Under Hq, E [hi] = 0, and T n is a zero-mean random walk. 

Proposition 1. E [T n ] = E [h] = n||5|| 2 , and 

var (T n ) = nvar(/i) = n(4tr(E 2 ) + 45 T E5) =: hVq. 

We assume for now that our data are bounded, i.e. 

HAyril < 1 / 2 , 

so that by the Cauchy-Schwarz inequality, w.p. 1, 

|Tn - T n -i\ = \(X 2n -i - Y 2 n-i) T (X 2n - Y 2n )I < 1 

Since T n has bounded differences, it exhibits Gaussian-like concentration under the null. We ex¬ 
amine the cumulative variance process of T n under Hq, 

n n 

^E [(Ti - Tj_ 1) 2 | ^ var (hi) = nV 0 

i =1 2=1 

Using this, we can control the behavior of T n under Hq. 

Theorem 3 ( [2] ). Take any f > 0. Then with probability > 1 — £, for all n simultaneously, 

\T n \ < 0,(0 + ^2CinUo[lnln] + (nU 0 ) + CiuVq In 
where Cq(£) = 3(e — 2)e 2 + 2 ^1 + (f)? an d = 6(e — 2). 
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Unfortunately, we cannot use the theorem directly to get computable deviation bounds for type 
I error control, because the covariance matrix T, is unknown a priori. nVo must instead be estimated 
on the fly as part of the sequential test, and its estimate must be concentrated tightly and uniformly 
over time , so as not to present a statistical bottleneck if the test runs for a long time. We prove such 
a result, necessary for sequential testing, relating n\ o to the empirical variance process V n = hf- 

Lemma 4. With probability > 1 — £, for all n simultaneously, there is an absolute constant C 3 
such that 

nV 0 < C 3 (\4 + C 0 (0) 

Its proof uses a self-bounding argument and is in the Appendix. Now, we can combine these to 
prove a novel uniform empirical Bernstein inequality to (practically) establish concentration of T n 
under Hq. 

Theorem 5 (Uniform Empirical Bernstein Inequality for Random Walks). Take any £ > 0. Then 
with probability > 1 — for all n simultaneously, 


\T n \ < C 0 (Z) + y2U n * (jlnhil+K* + In Q)) 

where V* := C' 3 (I 4 +C'o(^)), Co(£) = 3(e — 2 )e 2 +2 ^1 + In and C 3 is an absolute constant. 

Its proof follows immediately from a union bound on Thm. [3] and Lem. |4} Thm. [5] depends on 
V n , which is easily calculated by the algorithm on the fly in constant time per iteration ( 8 |. Ignoring 
constants for clarity, Thm. [5] effectively implies that our sequential test from Figure [l] controls type 
I error at a by setting 


q n oc In 



+ 


\ 


214 hr 


hi 14 


a 


(7) 


Practically, we suggest using the above threshold with a constant of 1.1 to guarantee type-I error 
approximately a (this is all one often wants anyway, since any particular choice of ct = 0.05 is 
anyway arbitrary). This is what we do in our experiments, with excellent success in simulations. 
For exact or conservative control, consider using a small constant multiple of the above threshold, 
such as 2 . 

The above sequential threshold is remarkable, because wrapped into the practically useful and 
simple expression is a deep mathematical result - the uniform Bernstein LIL effectively involves 
a union bound for the error probability over an infinite sequence of times. Any naive attempt to 
union bound the error probabilities for a possibly infinite sequential testing procedure will be too 
loose and hence too conservative - indeed, the classical LIL is known to be asymptotically tight 
including constants, and our non-asymptotic LIL is also tight up to small constant factors. 

This type-I error control with an implicit infinite union bound surprisingly does not lead to 
a loss in power. Indeed, our statistic possesses essentially the same power as the corresponding 
linear-time batch two sample test, and also stops early for easy problems. We make this precise in 
the following two subsections. 
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3.2 A Linear-Time Batch Test 

Here we study a simple linear-time batch two-sample mean test, following the template in Fig. [TJ 

N 

Consider the linear-time statistic T/v = E hi, where, as before, hi = (x 2i - 1 - y 2 i-i) T {x 2i - y 2 i)- 

1=1 

Note that the h/s are also i.i.d., and T/v relies on 2 N data points from each distribution. 

Let Vnq,Vni be var(T/v) = N var(/i) under Hq,Hi respectively. Recalling Proposition[lj 

V N0 :=NV 0 :=4IVtr(E 2 ), 

V m := NVi := JV(4tr(E 2 ) + 4<5 T Ec5). 


Then since T/v is a sum of i.i.d. variables, the central limit theorem (CLT) implies that (where A 
is convergence in distribution) 

^H 0 A(0,1) (8a) 

4 Hl AA(0,1) (8b) 

Based on this information, our test rejects the null hypothesis whenever 

T/v > y/Vm z a , (9) 


T/y 
VVno 
T/y - iY^II 2 
VWi 


where z a is the l — a quantile of the standard normal distribution. So Eq. (8a) ensures that 

p H 0 


( t n ^ 


< a, 


giving us type I error control under Hq. 

In practice, we may not know V/vo> s ° we standardize the statistic using the empirical variance 
- since we assume N is large, these scalar variance estimates do not change the effective power 
analysis. For non-asymptotic type I error control, we can use an empirical Bernstein inequality J181 
Thm 11], based on an unbiased estimator of V/v- Specifically, the empirical variance of /i/s (V/v) 
can be used to reject the null whenever 


T/v > \J2V n ln(2/a) + ■ (10) 

Ignoring constants for clarity, the empirical Bernstein inequality effectively suggests that the batch 
test from Figure [I] will have type I error control of a on setting threshold 


p N oc In ( ^ j + W 2 V/v In f ^ 


( 11 ) 


For immediate comparison, we copy below the expression for q n from Eq. (J7J): 


V + 

a 


N 


2V n In 


In 14,. 


a 


q n oc In 


















This similarity explains the optimal power and stopping time properties, detailed in the next 
subsection. 

One might argue that if N is large, then Vjsr ~ V/v, and in this case we can simply derive the 
(asymptotic) power of the batch test given in Eq.([9]) as 


P H t 
= F lh 

= <f> 


t p N - N\\6\\ 2 

VVm 

Vn\\s\\ 2 


> Zo 


V N0 JV||<5|h 


Vni VVm 


V8tr(Z 2 ) + 85 T Z5 


- Zn 


tr(£ 2 ) 


tr(£ 2 ) + 5 T £5 


( 12 ) 


Note that the second term is a constant less than z a . As a concrete example, when £ = cr 2 /, and 
we denote the signal-to-noise ratio as W := —, then the power of the linear-time batch test is at 

least $ ( 


\V8d+8t> 2 


~ Z r 


3.3 Power and Stopping Time of Sequential Test 


The striking similarity of Eq. (11) and Eq. ([T]) , mentioned in the previous subsection, is not coinci¬ 
dental. Indeed, both of these arise out of non-asymptotic versions of CLT-like control and LIL-like 
control, and we know that in the asymptotic regime for Bernoulli coin-flips, CLT thresholds and 
LIL threshold differ by just oc Vln In n factors. Hence, it is not surprising to see the empirical Bern¬ 


stein LIL match empirical Bernstein thresholds up to oc ylnlnV^ factors. Since the power of the 
sequential test is at least the probability of rejection at the very last step, and since \/ln In n < 2 
even for n = 10 20 , the power of the linear-time sequential and batch tests is essentially the same. 
However, a sequential test that rejects at the last step is of little practical interest, bringing us to 
the issue of early stopping. 

Early Stopping. The argument is again identical to that Section [2j proving that E^-, [r] is nearly 
optimal, and arbitrarily close to optimal as /3 tends to zero. Once more note that the “optimal” 
above refers to the performance of the oracle linear-time batch algorithm that was informed about 
the right number of points to subsample and use for the one-time batch test. Formally, let n^(5) 
denote this minimum sample size for the two-sanrple mean testing batch problem to achieve a power 
/3, the * indicating that this is an oracle value, unknown to the user of the batch test. From Eq. (12), 

8Tr(£ 2 )+8<5 T £<5 


it is clear that for N > 


-{zp + z a ) , the power becomes at least (3. In other words, 


n}(S) < + zS (13) 

Theorem 6 . Under Hi, the sequential algorithm of Fig. [7] using q n from Eq. 0 has expected 
stopping time oc n*J6). 

For clarity, we simplify ([T]) and 0D> by dropping the initial In (^) additive term since it is soon 
dominated by the second term and does not qualitatively affect the conclusion. 
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3.4 Discussion 


This section’s arguments have given an illustration of the flexibility and great generality of the 
ideas we used to test the bias of the coin. In the two-sample setting, we just design the statistic 
T/v = X^i=i be a mean-zero random walk under the null. As in the coin’s case, the LIL controls 
type I error, and the rest of the arguments are identical because of the common concentration 
properties of all random walks. 

Our test statistic T/v is chosen with several considerations in mind. First, the batch test is 
linear-time in the sample complexity, so we are comparing algorithms with the same computational 
budget , on a fair footing. There exist batch tests using U-statistics that have higher power than 
ours ([21]) for a given N, but they use more computational resources ( 0(N 2 ) rather than O(N)). 

Also, the batch statistic is a sum of random increments, a common way to write many hypothesis 
tests, and one that can be computed on the fly in the sequential setting. Note that T/v is a scalar, 
so our arguments do not change with d, and we inherit the favorable high-dimensional statistical 
performance of the statistic; [2IJ bas more relevant discussion. The statistic also has been shown 
to have mighty generalizations in the recent statistics literature, which we discuss in Section [5] 

Though we assume data scaled to have norm 2 for convenience, this can be loosened. Any 
data with bounded norm B > \ can be rescaled by a factor just for the analysis, and then 
our results can be used. This results in an empirical Bernstein bound like Thm. [5] but of order 

O ^Co(£) + V n In ^^ ^ The dependence on B is very weak, and is negligible even when 

B = poly (d). 

In fact, we only require control of the higher moments (e.g. by Bernstein conditions, which 
generalize boundedness and sub-Gaussianity conditions 0) to prove the non-asymptotic Bernstein 
LIL in [2], exactly as is the case for the usual Bernstein concentration inequalities for averages ([3]). 
Therefore, our basic arguments hold for unbounded increments hi as well. In fact, the LIL itself, as 
well as the non-asymptotic LIL bounds of [2], apply to martingales - much more general versions 
of random walks capable of modeling dependence on the past history. Our ideas could conceivably 
be extended to this setting to devise more data-dependent tests, which would be interesting future 
work. 


4 Empirical Evaluation 


In this section, we evaluate our proposed sequential test on synthetic data, to validate the predic¬ 
tions made by our theory concerning its type I/II errors and the stopping time. 

We simulate data from two multivariate Gaussians {d = 10), motivated by our discussion at 
the end of Section 3.2: each Gaussian has covariance matrix E = ct 2 /^, one has mean fi\ = 0 d and 
the other has fi 2 = (<5,0,0,..., 0) 6 for some 5 > 0. We keep a = 1 here to keep the scale of 
the data roughly consistent with the biased-coin example, though we find the scaling of the data 
makes no practical difference, as we discussed. 


4.1 Running the Test and Type I Error 

Like typical hypothesis tests, ours is designed to control type I error. When implementing our 
algorithmic ideas, it suffices to set q n as in Q, where the only unknown parameter is the pro¬ 
portionality constant C. The theory suggests that this is an absolute constant, and prescribes an 
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upper bound for it, which can conceivably be loose because of the analytic techniques used (as |2] 
discusses). On the other hand, in the asymptotic limit the bound becomes tight; the empirical 
V n converges quickly to its mean V n , and we know from second-moment versions of the LIL that 
C = \/2, and Co = 0 suffice. However, as we consider smaller finite times, that bound must relax 
(at the extremely low t = 1 or 2 when flipping a fair coin, for instance). 

Nevertheless, we find that in practice, for even moderate sample sizes like the ones we test 
here, the same reasonable constants suffice in all our experiments: C = \/2 and Co = log(-), with 
Cq following Thm. [5] and similar fixed-sample Bennett bounds (13 El; also see Appendix [Pj) . The 
situation is exactly analogous to how the Gaussian approximation is valid for even moderate sample 
sizes in batch testing, making possible a huge variety of common tests that are asymptotically and 
empirically correct with reasonable constants to boot. 

To be more specific, consider the null hypothesis for the example of the coin bias testing given 
earlier; these fair coin flips are the most anti-concentrated possible bounded steps, and render our 
empirical Bernstein machinery ineffective, so they make a good test case. We choose C and Co 
as above, and plot the cumulative probability of type I violations Pr# 0 (r < n) up to time n for 
different a (where t is the stopping time of the test), with the results in Fig. [2j To control type I 
error, the curves need to be asymptotically upper-bounded by the desired a levels (dotted lines). 
This does not appear true for our recommended settings of C , Co, but the figure still indicates that 
type I error is controlled even for very high n with our settings. A slight further raise in C beyond 
\/2 suffices to guarantee much stronger control (Appendix |g|). 

Fig- @ also seems to be contain linear plots, which we cannot fully explain. We conjecture it is 
related to the standard proof of the classical LIL, which divides time into epochs of exponentially 
growing size (CD- 

For more on provable correctness with low C, see Appendix[Dj or Appendix [G] for more empirical 
discussion. 



ln(n) 


Figure 2: Prj^ 0 (r < n) for different a , on biased coin. Dotted lines of corresponding colors are the 
target levels a. 


4.2 Type II Error and Stopping Time 

Now we verify the results at the heart of the paper - uniformity over alternatives 5 of the type II 
error and stopping time properties. 
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Figure 3: Power vs. ln(iV) for different 5, on Gaussians. Dashed lines represent power of batch test 
with N samples. 


Fig. 0 plots the power of the sequential test Ph 1 (S)( t < N) against the maximum runtime N 
using the Gaussian data, at a range of different alternatives <5; the solid and dashed lines represent 
the power of the batch test ( |11[ ) with N samples, and the sequential test with maximum runtime 
N. As we might expect, the batch test has somewhat higher power for a given sample size, but 
the sequential test consistently performs well compared to it. The role of N here is basically to 
set a desired tolerance for error; increasing N does not change the intermediate updates of the 
algorithm, but does increase the power by potentially running the test for longer. So each curve 
in Fig. [3] transparently illustrates the statistical tradeoff inherent in hypothesis testing against a 
fixed simple alternative, but the great advantage of our sequential test is in achieving all of them 
simultaneously with the same algorithm. 

To highlight this point, we examine the stopping time compared to the batch test for the 
Gaussian data, in Fig. |dj We see that the distributions of ln(r) are all quite concentrated, and 
that their medians (marked) fit well to a slope-4 line, showing the predicted ^ dependence on 5. 
Some more experiments are in Appendix G.l| 



Figure 4: Distribution of log^sf^) for 6 E {0.5(1.25) c : c E {7, 6 ,..., 0}}, so that the abscissa 
values {log! 25 (j)} are a unit length apart. Dashed line has slope 4. 
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5 Further Extensions 


A General Two-Sample Test. Given two independent multivariate streams of i.i.d. data, instead 
of testing for differences in mean, we could also test for differences in any moment, i.e. differences 
in distribution, a subtler problem which may require much more data to ascertain differences in 
higher moments. In other words, we would be testing 

Hq : P = Q versus Hi : P / Q 

One simple way to do this is by using a kernel two-sample test, like the Maximum Mean 
Discrepancy (MMD) test proposed by [TO] . The population MMD is defined as 

MMD(P, Q) = sup (E x „pf(X)-E Y ^ Q f(Y)) 

f&H k 

where H & is the unit ball of functions in the Reproducing Kernel Hilbert Space corresponding to 
some positive semidefinite Mercer kernel k. One common choice is the Gaussian kernel k(a,b) = 
exp(—||a — 6 || 2 / 7 2 ). With this choice, the population MMD has an interesting interpretation, given 
by Bochner’s theorem [23] as 

MMD = f | ipx{t) ~ </?y(i)| 2 e~ 72 ^^dt 

J R d 

where tpx(t), <py(t) are the characteristic functions of P, Q. This means that the population MMD 
is nonzero iff the distributions differ (i.e. the alternative holds). 

The authors of HQ! propose the following (linear-time) batch test statistic after seeing 2 N 
1 N 

samples: MMD^v = 77 ^ hj, where h t = k(x 2i ,x 2i+ i) + k(y 2i , y 2i+ i) - k(x 2i , y 2i+ i) - k(x 2i+ i,y 2i ). 

2—1 

The associated test is consistent against all fixed (and some local) alternatives where P 7 ^ Q\ 
see [ 10 ] for a proof, and [ 21 ] for a high-dimensional analysis of this test (in the limited setting of 
mean-testing that we consider earlier in this paper). Both properties are inherited by the following 
sequential test. 

The sequential statistic we construct after seeing n batches (2n samples) is the random walk 
T n = E"=i which has mean zero under the null because IE [MMDjv] = IE [hi] = 0. The 
similarity with our mean-testing statistic is not coincidental; when k(a,b ) = a T b, they coincide, 
further motivating our choice of test statistic U n earlier in the paper. As before, we use the LIL 
to get type I error control, nearly the same power as the linear-time batch test, and also early 
stopping much before seeing N points if the problem at hand is easy. 

A General Independence Test. Given a single multivariate stream of i.i.d data, where each 
datapoint is a pair (Xi,Yi) € M p+,? , the independence testing problem involves testing whether X 
is independent of Y or not. More formally, we want to test 

H 0 : X _L Y versus Hi : X JL Y . (14) 

A test of linear correlation/covariance only detects linear dependence. As an alternative to this, 
[26] proposed a population quantity called distance covariance , given by 

dCov(X, Y) = E||X - X'llljy - y'|| + E\\X - X'\\E\\Y - Y'\\ - 2E\\X - X'\\\\Y - Y "|| 
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where (A", Y), (A', Y'), (A", Y") are i.i.d. pairs from the joint distribution on (A, Y). Remarkably, 
an alternative representation is 

dCov(A, Y) = j \4>x,Y(t, s) — <f>x(t)(j)Y(s)\ 2 vj(t, s) dt ds 

RP+i 


where (j>x, 4>y, (ftx.Y are the characteristic functions of the marginals and joint distribution of A, Y 
and w(t,s ) oc P||p +P ||s||g +9 . Using this, the paper [26] concludes that dCov(A, Y) = 0 iff X _L Y. 
One way to form a linear-time statistic to estimate dCov is to process the data in batches of size 
four, i.e. Bi = Uj=o (X 4 i+j,Yu + j), and calculate the scalar 


tn = - V||A-A , ||||r"-y ,// || + -Viia-A'|||| y-y , || - — 


G) 


G) 


Viia- A / ||||y-y ,, | 
24 ^ 1111 1 

4x3 


where the summations are over all possible ways of assigning (A, Y) ^ (A ',Y') ^ (A ",Y") ^ 
(A "',Y'"), each pair being one from Bi. The expectation of this quantity is exactly dCov, and the 
batch test statistic, given 2N datapoints, is simply dCov/v = x As before, the associated 

test is consistent for any fixed alternatives where A JL Y. Noting that E [dCov^r] = E [hi] = 0 
under the null, our random walk after seeing n batches (i.e. 4n points) will just be T n = i 
As in previous sections, the LIL results from [2] can be used to get type I error control, and early 
stopping much before seeing N points, if the problem at hand is statistically easy. 


6 Related Work 

Parametric or asymptotic methods. Our statements about the control of type I/II errors and 
stopping times are very general, following up on early sequential analysis work. Most sequential 
tests operate in the Wald’s framework expounded in m- In a seminal line of work, Robbins and 
colleagues delved into sequential hypothesis testing in an asymptotic sense [23]. Apart from being 
asymptotic, their tests were most often for simple hypotheses (point nulls and alternatives), were 
univariate, or parametric (assuming Gaussianity or known density). That said, two of their most 
relevant papers are [22] and [0], which discuss statistical methods related to the LIL. They give an 
asymptotic version of the argument of Section [2] using it to design sequential Kolmogorov-Smirnov 
tests with power one. Other classic works that mention using the LIL for testing various simple or 
univariate or parametric problems include ummm- These all operate in the asymptotic limit 
in which the classic LIL can be used to set qx- 

For testing a simple null against a simple alternative, Wald’s sequential probability ratio test 
(SPRT) was proved to be optimal by the seminal work [29], but this applies when both the null 
and alternative have a known parametric form. The same authors also suggested a univariate 
nonparametric two-sample test in [28], but presumably did not find it clear how to combine these 
two lines of work. 

Bernstein-based methods. Finite-time uniform LIL-type concentration tools from [2] are crucial 
to our analysis, and we adapt them in new ways; but novelty in this respect is not our primary 
focus here, because less recent concentration bounds can also be used to yield similar results. It 
is always possible to use a weighted union bound (allocating failure probability £ over time as 
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o< over fixed-n Bernstein bounds, resulting in a deviation bound of O y\JVn In • A more 

advanced “peeling” argument, dividing time n into exponentially growing epochs, improves the 


bound to O ^yV^ln^pj (e.g. in (T2j). This suffices in many simple situations, but in general is 

still arbitrarily inferior to our bound of O ^ \JVn hi In , precisely in the case 14 < n in which 
we expect the second-moment Bernstein bounds to be most useful over Hoeffding bounds. A yet 
more intricate peeling argument, demarcating the epochs by exponential intervals in V n rather 
than n, can be used to achieve our iterated-logarithm rate, in conjunction with the well-known 
second-order uniform martingale bound due to Freedman ([9]). This serves as a sanity check on 
the non-asymptotic LIL bounds of [2j, where it is also shown that these bounds have the best 
possible dependence on all parameters. However, it can be verified that even a suboptimal uniform 


concentration rate like O 




would suffice for the optimal stopping time properties of the 


sequential test to hold, with only a slight weakening of the power. 

Bernstein inequalities that only depend on empirical variance have been used for stopping 
algorithms in Hoeffding races m and other even more general contexts |19j . This line of work 
uses the empirical bounds very similarly to us, albeit in the nominally different context of direct 
estimation of a mean. As such, they too require uniform concentration over time, but achieve it with 


a crude union bound (failure probability oc 4j), resulting in a deviation bound of O V n ln . 
Applying the more advanced techniques above, it may be possible to get our optimal concentration 
rate, but to our knowledge ours is the first work to derive and use uniform LIL-type empirical 
Bernstein bounds. 


In practice. To our knowledge, implementing sequential testing in practice has previously in¬ 
variably relied upon CLT-type results patched together with heuristic adjustments of the CLT 
threshold (e.g. the well-known Haybittle-Peto scheme for clinical trials m has an arbitrary con¬ 
servative choice of q n = 0.001 through the sequential process and q n = 0.05 = a at the last 
datapoint). These perform as loose functional versions of our uniform finite-sample LIL upper 
bound, though without theoretical guarantees. In general, it is unsound to use an asymptotically 
normal distribution under the null at stopping time r — the central limit theorem (CLT) applies to 
any fixed time t, but it may not apply to a random stopping time r (see Anscombe’s random-sum 
CLT [I] HTj and related work). This has caused myriad practical complications in implementing 
such tests (see HE Section 4). One of our contributions is to rigorously derive a directly usable 
finite-sample sequential test, in a way we believe can be generically extended. 

We emphasize that there are several advantages to our proposed framework and analysis which, 
taken together, are unique in the literature. We tackle the multivariate nonparametric (possibly 
even high-dimensional) setting, with composite hypotheses. Moreover, we not only prove that the 
power is asymptotically one, but also derive finite-sample rates that illuminate dependence of other 
parameters on /3, by considering non-asymptotic uniform concentration over finite times. The fact 
that it is not provable via purely asymptotic arguments is why our optimal stopping property has 
gone unobserved for a wide range of tests, even as basic as the biased coin. In our more refined 
analysis, it can be verified (Thm. [2J that the stopping time diverges to oo when the required type 
II error —> 0, i.e. power —> 1. 
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7 Conclusion 


We have presented a sequential scheme for multivariate nonparametric hypothesis testing against 
composite alternatives, which comes with a full finite-sample analysis in terms of on-the-fly es¬ 
timable quantities. Its desirable properties include type I error control by considering finite-time 
LIL concentration; near-optimal type II error compared to linear-time batch tests, due to the 
iterated-logarithm term in the LIL; and most importantly, essentially optimal early stopping, uni¬ 
formly over a large class of alternatives. We presented some simple applications in learning and 
statistics, but our design and analysis techniques are general, and their extensions to other settings 
are of continuing future interest. 
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A Proof of Theorem [2] 


Proof. Write K as a placeholder absolute constant in the sense of Sec. [2] Then for any sufficiently 
high to, our definitions for q n and p n tell us that 


Ph { (r > to) = P Hl (Vt < n : S n < g n ) < Pjq (S n < <? n ) 

= Ph 1 (S n - nd < q n - n5) 

< Ph 1 (S n — n5 < —KnS) (15) 

< P ( 16 ) 


for n > to*, from 
Ph i (r > n) < e“ 


51) and the definition of n* 
So for any S and /3, 


Also, using a Hoeffding bound on (15), we see that 


n= 1 
oo 

<TO*+ ^ 


TO 


e~ Kn82 <n*+ P 


n=n* 

K 


1 - e~ KS2 


< n* + (3 


c 


KS 2 


+ 1 < 1 + 


ln ^ / 


TO 


(17) 

(18) 


Here (|17[) sums the infinite geometric series with initial term (e 


K In - 


inequality < y + 1 as well as ri*p(8) > —p 


n*5 2 ^K < pK^ anc j ugeg t j ie 

□ 


B Proof of Proposition [l] 

Proof of Proposition^ 7} Since x, x ', y, y' are all independent, E [/?.] = (E [x] —E [y]) T (E [x'j —E [y']) = 
Next, 


Since E [(x — y)(x 


E [ h 2 


= E 
= E 
= E 


((x-y) T (x' -y')f 
(x - y) T (x' - y')(x' - y') T (x - y) 
tr((x - y)(x - y) T (x' - y')(x' - y ') T )) 


= tr ^E (x — y)(x — y) T E (x'— y')(x'— y') T j 


y) T ] = Si + T ,2 + 55 t = 2E + M T , we have 

var (h) = E [h 2 ] - (Eh) 2 = tr[(2E + <5<5 T ) 2 ] - p|| 4 
= 4tr(E 2 ) + 4<5 T E<5 


from which the result is immediate. 


a 
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C Proof of Theorem [3] 


We rely upon a variance-dependent form of the LIL. Upon noting that E [T n — T n -i] = 0 and 
E [(T n — T-i) 2 ] = Vq, it is an instance of a general martingale concentration inequality from [2j. 


Theorem 7 (Uniform Bernstein Bound (Instantiation of [2], Theorem 4)). Suppose \T n — T n -i 


w.p. 1 for all n > 1. Fix any £ < 1 and define tq(£) = min < s : sVq > 


y+v^y 

e—2 


In 


with probability > 1 — £, for all n> tq simultaneously, \T n \ < 2 l e jL. -tVo and 

(i+0 31 


I < 1 

Then 


< ^ 1 6(e — 2)tVo ^2 In In ^ 


3(e — 2)e 2 fVo\ 
\T n \ ) 


+ hr ( % 


In principle this tight control by the second moment is enough to achieve our goals, just as the 
second-moment Bernstein inequality for random variables suffices for proving empirical Bernstein 
inequalities. 

However, the version we use for our empirical Bernstein bound is a more convenient though 
looser restatement of Theorem [7J To derive it, we refer to the appendices of [2] for the following 
result: 

Lemma 8 ([2], Theorem 16). Take any £ > 0, and define T n and to(£) as in Theorem [?| With 
probability > 1 — for all n < to(£) simultaneously, 

IT,| < 2 (l + 073) In 

Theorem [3] follows by loosely combining the above two uniform bounds. 

Proof of Theorem [3| Recall V n := n\ 0 . Theorem [?] gives that w.p. 1 — for all n > , 

\T n \ < / Vn and 

( 1 + 073 ) 


|T n | < max ( 3(e — 2)e 2 , ^ / 2C\ V n In In V n + C\ V n In ( - 


(19) 


Taking a union bound of (19) with Lemma [8] gives that w.p. > 1 — £, the following is true for all 
n simultaneously: 


IT I < 


2 1 + 


ln 


V n and max ^3(e - 2)e 2 , 2C\V n In In V n + C\V n In (| 


if t < r 0 (f/2) 
if n > r 0 (T 2 ) 


For all n we have |T n | bounded by the maximum of the two cases above. The result can be seen to 
follow, by relaxing the explicit bound \T n \ < y n instead transform lnln into [lnln] + . □ 


i+a/1/3 
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D Proportionality Constants and Guaranteed Correctness 


After observing the first few samples, regardless of how many, it is impossible to empirically conclude 
with certainty that the type I error of a sequential test (Fig. [ 2 ]) has ever leveled off. And although 
our theory can guarantee type I error control, it is reasonable to question whether our empirically 
recommended prescription C = \/2 is actually sound, even in the hypothetical case n — > 00. 

In fact, we can show that it is unsound. Consider first the biased coin example of Sec. E If5 - 
is the test statistic, the number of type I error violators under the null is 

Sn ^ . , S n 

sup = > mt sup = 

n> 1 v n In In n k > 1 n>k v n In In n 

v S n 

= Inn sup = 

k^oo n >£ y n In In n 

= lim sup = = V2 

n —>00 v n In In n 

w.p. 1, from the asymptotic LIL of Thm. [I] 

So the sequential test will almost surely reject with C = 2, which is very undesirable. We still 
recommend this for two reasons. 

Firstly, it appears not to be an empirical issue, because of Co and because of our finite N 
needed in practice to detect the alternative. As evidence of this, we count type I violations under 
the fair-coin null (the maximally anti-concentrated stochastic process under the random walk) for a 
very large N = 10 6 with C = 3, a = 0.05, Cq = log repeatedly using 10 5 Monte Carlo trials. We 
see an average of 3-5 type-1-violating sample paths (out of 10 5 ) - almost no type I error, because 
C is relatively high. 

Secondly, it is possible to set C > 2 and get provable type I error control, at the cost of a 
somewhat higher stopping time. In the theory, C and Co can be tightened for sufficiently high 
sample sizes 021) - the reason is that for sufficiently high n, the order of growth of the bound is 
dominated by 0(s/n In Inn), and all the sources of looseness in the analysis leading to our final 
uniform empirical Bernstein inequality (Thm. [ 5 ]) can therefore be bounded by increasing the 
iterated-logarithm proportionality constant. 

These ideas generalize cleanly beyond the biased-coin example to our other tests. Exactly the 
same argument as above can be used with our two-sample mean test statistic T ni its variance 
process V n , and the variance-based asymptotic LIL ([2,5]) to give, w.p. 1, 


T n 


sup 


n> 1 VV n In In 14 


> y/2 


So even after taking into account the convergence rate of the empirical variance V n ^ V n , our basic 
conclusions and recommendations remain the same for all our tests beyond the biased-coin setting. 


D.l Type II Error Approximation 

We argue that the power of the sequential test with maximum runtime N is approximately lower- 
bounded by the power of the batch test with N examples (e.g. in ([ 3 ]) for the coin example). This 
argument can be made more exact. 
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For the coin example, we can work with a more refined approximation via the CLT, when N is 
high so that N5 > qN > Pn- Defining p s N = PN ^ 5 and q s N = qN ^ 5 , 

-Pffi(<5) (3n < N : S n > q n ) > Ph^s) {Sn > qN) 

= Ph x (8) (Sn > Pn) ~ Ph^s) {qn > Sn > pn) 

= p H t (S) {Sn > Pn) ~ Pn t (q s N > — > Pn ^ 

~ Ph^s) {Sn > Pn) - {${Qn) ~ ®(Pn)) 

When there is an abundance of data, the sequential test would typically be run with very large 
N , since it would typically stop much sooner (see Appendix [G]) . So this CLT approximation is in 
fact extremely good, and it can be made into a lower bound if necessary with a negligible N~°W 
deviation term. A similar argument can be made for our more complex tests. 


E Proof of Lemma |4| 

Proof. Here, u t := h'j — E \hf\ has mean zero by definition. It has a cumulative variance process 
that is self-bounding: 


B„:=^E[^]=^E {h*-E[hZ }) 2 


2=1 


2=1 


2=1 


E( e M-( e M)')<E e W1 

2=1 

(a) - 

\ W 1^1 _„TZ_ ._ /±n 
2=1 


< i h i\ = nV ° :=A - 


( 20 ) 


where the last inequality (a) uses that \hi\ < 1, and we define the process A n for convenience. 
Applying Theorem [ 3 ] to the mean-zero random walk )T)” =1 i/j gives (1 — £)-a.s. for all t that: 


V n —A r 


EG- e [a?]) 


2=1 



< Co(£) + -\/2C'iA n [lnln]_(_(A n ) + C\A n \n 


This can be relaxed to 


A n - y 2C\ A n [In In] + (A n ) + C 1 A n In 

- Co(0 -Vn< 0 
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Suppose A n > 108 In ( ^ J . Then a straightforward case analysis confirms that 


A n > 8 max ( 2C'i[lnln] + ( J 4 n ), C\ In ( - 


This is precisely the condition needed to invert (21) using Lemma 9l Doing this yields that 


<^2Ci[lnln]+ (20,(0 + 2V n ) + Ci In Q) 


+ V Co(0 + V n 


( 22 ) 


For sufficiently high V n (O (in (|) ) suffices), the first term on the right-hand side of (22) is bounded 


as y 2Ci[lnIn] 4. (20,(0 + 2V n j + O In (f J < W4C'i[lnln]+ (20(0 + 2V n ) < ^ 8O (0(0 + V n ). 
Resubstituting into p2|) and squaring both sides yields the result. It remains to check the case 


A n < 108 In (0. But this bound clearly holds in the statement of the result, so the proof is 
finished. □ 


The following lemma is useful to invert inequalities involving the iterated logarithm. 
Lemma 9. Suppose foi, 62, c are positive constants, x > 8max(6i[lnln] + (x), 62)? and, 

x — y / 6ix[lnln] + (x) + b 2 x — c < 0 

Then 

\fx < y / 6i[lnln]+(2c) + b 2 + yfc. 

Proof. Suppose x > 8max(6i[lnln]+(x), 62). Since x > 862, we have 


(23) 


x 


0 < - - 62 < 


X -- hl (±-\- 


\8i>i 


0 <- b\x 

4 


62 =► 

(£) “ 1,21 


Substituting the assumption ^ > [In ln] + (x') gives 


0 < —— 6ix[lnln] + (x) — 62® => 

\/feix[lnln] + (x) + b 2 X < -x 
Substituting this into (23) gives x < 2c. Therefore, again using (23), 

0 > x — \/6ix[lnln] + (x) + b 2 x — c 
> x — \/&ix[hiln] + (2c) + b- 2 x — c 
This is now a quadratic in yfx. Solving it (using y/x > 0) gives 

In In] + (2c) + b 2 + \Jb\ [In In] + (2c) + b 2 + 4c 

< Y / &i[lnln] + (2c) + b 2 + yfc 
using the subadditivity of yf. 


1 

x -2 


□ 
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F Proof of Theorem |6| 


In this proof we use -<,^,>-,^,x to denote <,<,>,>,= when ignoring constants. Let us first 
bound P(t > n) for sufficiently large n such that V n x V n and that yj 21og(l/a) In Inn <C n||5|| 2 : 

Ph i (r > n) = 1 - P Hl (r < n) = 1 - P Hl (3n < n : T n > q n ) 

< 1 - Ph x (T n > Qn) 


We have q n x p n y In In 14 4 p n \/\i\ In n; recalling x yj2V n log(l/a) from Eq. 0> we get 

Pffi (r > n) 4 1 - P ffl >- y/2V n log(l/a)Vlnlnn^ 

_ / T n - n||<5|| 2 v V 2 ^(V®) In Inn - n\\S\\ 2 \ 

~ Hi \VK Wn ) 


The above expression then corresponds to a tail inequality for the centered standardized random 
variable on the LHS, which is a sum of bounded random variables, and hence standard sub-Gaussian 
inequalities yield 


P Hi {t > n) P exp(-n 2 ||<5|| 4 /14) 


= exp — n 


Recalling from Eq. (13) that 


8Tr(T> 2 ) + 8<5 T E<5 


8Tr(£ 2 ) + 8S T £^ , , 2 

n R [TTTm K z /3 + Za) , 


(24) 


we infer from this and (24) that Ph\{ t > n *p) is a small constant bounded away from 1. 
Since t = ^) n I(r > n), we have by summing a geometric series that 


E ^i M = H Ph i ( r > 

n> 1 

-( r > 


n 


n>nt 


> n£) 


- n ^ + r ^ ( 

1 exp ^ 8Tr(E 2 )+8S T SS / 

Using the inequality 1 — exp(— x) < x, i.e. exp(— x) > 1 — x, and substituting for rip, we get 

E„, W < »J + 8 rr(S ||)|) 4 8 {T£ ' ! P« 1 (r > nj) 

x "i + (^rkp p,, ' (T>n ? ) 

= (1 + 0(l))n^ 
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G Experimental Protocol 


This section contains some notes on the experiments. 

The graphs of Fig. [2] are each generated by 10,000 Monte Carlo trials. The remaining graphs 
all use a = 0.05. 

The graphs of Fig. [3] are each generated by 1,000 Monte Carlo trials on the data, and the solid 
lines are the resulting stopping time distribution of the sequential test. As for the dashed lines, 
the true power of the batch test is also estimated by 1,000 Monte Carlo trials. Note that these 
experiments are run with N = 50000, even though the tests always seem to stop much sooner 
because the 5 are sufficiently high. When data are abundant enough to detect any discernible 
difference between the samples, we suggest setting N very high, as this gives better power. 

The graphs of Fig. [4] are each generated by 1,000 Monte Carlo trials. 

Evaluating the dependence on dimensionality d is outside our scope in this paper. The high¬ 
dimensional properties of our statistic are further evaluated and discussed in [21], which shows that 
it is possible to achieve better high-dimensional power with fewer samples than our test statistic. 
But our standout contribution is sequential, and we focus on these aspects. 

G.l Supplemental Graphs 
Type I Error 

In Figure[2j we see that the cumulative type I error rate is increasing, not leveling off. To change this, 
the proportionality constant on the iterated logarithm C must be increased. The result of C = 2.2 
is plotted to the right of Figure [5j with the C = 2 random walk of Figure [2] at left on the same scale 
for comparison. We see that just a slight increase in C lowers type I violations significantly; at 
every a, the type I error is less than half of the desired tolerance. Extrapolating the linear graphs, 
we predict that type I error will be controlled up to the huge sample size ~ e 25 « 7.2 x 10 10 for 
every a, and further increases in C make it infeasible to run for long enough to break type I error 
control. This gives some empirical validation for our recommendations. 




Figure 5: Pr h 0 (t < n) for different a, on biased coin, for C = 2 (left) and C = 2.2 (right). 

We can also look at the stopping time under the null with our simulated Gaussians.these show 
better empirical concentration than the coin, unsurprisingly; so we do not graph them here. 
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Type II Error 

For completeness, we give the equivalents to Figs. [3] and [4] here, as Figs. [6] and [7] respectively. Note 
that the dependence of r on 5 in Fig. 0 is O(p), as the theory for the coin predicts. 
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Figure 6: Power vs. ln(iV) for different 5, on Gaussians. Dashed lines represent power of batch test 
with N samples. 



Figure 7: Distribution of ln(r) for 5 £ {e 1 : c E {1, 2,..., 5}}, so that the abscissa values {ln(|)} 
are a unit length apart. Dashed line has slope 2. 
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